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1.0  INTRODUCTION 


The  accurate  prediction  of  international  events  has  been  a  major 
goal  of  researchers  for  some  time.  An  approach  relying  on  a 
Bayesian  Markov-renewal  model  has  been  developed  recently  under 
sponsorship  of  the  Defense  Advanced  Research  Projects  Agency 
(DARPA)  by  Drs.  George  Duncan  and  Brian  Job  (Duncan  and  Job, 

1977;  Duncan,  1977;  Job,  1977)  and  implemented  on  a  DARPA 
PDP-11/70  computer  system  by  Mr.  James  Allen  of  Response 
Resource,  Inc.  An  analysis  of  the  general  approach  has  been 
conducted  earlier,  but  without  benefit  of  actual  data  (Chinnis 
et  al. ,  1981) . 

An  opportunity  to  evaluate  the  forecasting  system  (referred  to 
as  "PREDICT")  arose  in  June  of  1981  when  the  Defense  Intelli¬ 
gence  Agency  (DIA)  agreed  to  develop  and  use  a  number  of  PREDICT- 
implemented  models  for  a  period  of  several  months,  with  analysts 
providing  daily  inputs.  An  experimental  test  was  planned  which 
would  permit  a  comparison  of  direct  probabilistic  forecasts  made 
by  the  analysts  each  day  with  probabilistic  forecasts  made  by 
the  computer  model.  Actual  models  were  constructed  by  the 
analysts  under  the  supervision  of  Dr.  Lou  Johnson  of  Court land 
International.  DSC’s  primary  role  was  to  analyze  the  resulting 
data  and  produce  the  evaluation  presented  in  this  report. 


1 


1 


I 


> 


I 


» 


2.0  EXPERIMENTAL  DESIGN 

Six  models  were  constructed,  each  of  a  different  world  situa¬ 
tion.  Each  model  was  constructed  by  a  different  analyst.  That 
analyst  assisted  in  the  defining  of  five  states  representing  a 
mutually  exclusive  and  reasonably  exhaustive  partitioning  of  a 
continuum  ranging  from  relatively  peaceful  (state  1)  to  relative¬ 
ly  war-like  (state  5).  The  same  analyst  also  provided  the 
required  model  inputs  such  as  transition  waiting  times.  Final¬ 
ly,  the  analyst  provided  on  a  daily  basis  an  evaluation  of  the 
state  actually  observed  on  that  day  (which  was  supplied  to  the 
computer  model). 

In  addition  to  the  assessment  of  the  current  state  on  each  day, 
each  analyst  made  direct,  unaided,  forecasts  of  the  probability 
of  observing  each  of  the  five  states,  five,  ten,  and  thirty  days 
in  the  future.  These  direct,  unaided,  forecasts  were  made 
without  any  knowledge  of  the  corresponding  computer-generated 
forecast . 
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3.0  EVALUATION  APPROACH 

The  evaluation  of  a  forecast,  let  alone  a  forecasting  method,  is 
far  from  straightforward.  Ideally,  the  penalty  associated  with 
assigning  non-zero  probability  to  an  event  (or  state)  which  does 
not  occur  should  depend  upon  the  consequences  of  decisions  made 
based  upon  the  forecast.  For  obvious  practical  reasons  such  an 
evaluation  would  be  impossible  in  most  settings. 

3 . 1  Ranked  Probability  Score 

A  number  of  reasonable  approaches  to  forecast  evaluation  are 
available,  however,  and  three  were  selected  for  the  present 
situation.  Of  primary  interest  is  the  ranked  probability  score 
( RPS )  originally  developed  to  evaluate  weather  forecasts 
(Murphy,  1970)  .  The  RPS  assigns  an  increasing  penalty  to  the 
probability  assigned  to  states  progressively  distant  from  the 
actual  observed  state.  Technical  derails  are  concisely  pre¬ 
sented  in  Murphy  (1970).  Essentially  the  RPS  is  appropriate  for 
estimating  the  expected  utility  of  a  forecast  when  the  distinct 
states  can  be  placed  along  a  continuum  representing  the  severity 
of  some  subject  of  interest  (weather  originally,  and  political 
unrest  in  the  present  case). 

As  used  here,  the  RPS  assigned  to  each  forecast  of  five  state 
probabilities  ranges  from  a  best  score  of  one  (when  a  categori¬ 
cal  forecast  of  the  correct  state  is  made)  to  a  worst  score  of 
zero  (when  a  categorical  forecast  of  state  1  or  state  5  is  made 
and  the  opposite  state  is  later  observed).  .  Note  that  while  the 
maximum  RPS  is  one  regardless  of  which  state  occurs,  the  minimum 
RPS  depends  upon  the  state  which  occurs.  In  the  present  con¬ 
text,  this  reflects  the  reality  that  the  degree  to  which  a 
forecast  can  be  a  bad  one  is  limited  when  moderate  outcomes 
occur  (such  as  state  3)  but  is  severe  when  extreme  outcomes 
actually  occur  (such  as  state  5). 
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In  mathematical  terms,  the  RPS  when  state  j  occurs  and  r  repre¬ 
sents  the  vector  of  probability  estimates  across  states  is 
designated  as  RPSj(£),  and  can  be  expressed  as 
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3.2  Brier  Score 


Although  the  political  models  constructed  for  the  present  evalua¬ 
tion  are  characterized  by  five  states  of  increasing  severity,  as 
required  by  the  RPS  method,  an  argument  might  be  made  that  the 
nature  of  the  particular  models  is  such  that  only  the  uppermost 
state  is  of  real  concern.  In  other  words,  it  is  possible  that 
the  RPS  treatment  of  probabilities  assigned  to  the  various 
states  is  inappropriate  because  of  a  discontinuity  in  terms  of 
costs  of  errors  associated  with  state  5:  correctly  predicting 
the  "war"  is  all  that  matters.  For  such  a  case,  an  appropriate 
measure  of  forecast  quality  is  the  Brier  score,  usually  referred 
to  simply  as  the  probability  score,  PS.  The  PS,  as  used  here, 
is  equivalent  to  the  RPS  for  the  special  case  of  a  two-state 
model  (state  5  and  not  state  5).  Thus  the  best  possible  PS  is 
one  and  is  achieved  by  assessing  a  probability  of  one  on  the 
state  which  actually  occurs,  and  the  worst  possible  PS  is  zero 
and  is  achieved  by  assessing  a  probability  of  one  (a  "categori¬ 
cal"  forecast)  on  the  wrong  state.  Note  that  unlike  the  RPS  for 
five-state  models,  the  minimum  PS  does  not  depend  upon  which 
state  actually  occurs. 
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Mathematically,  when  state  j  occurs,  the  PS  used  here  can  be 
defined  as 

PS  j  (r )  =  1  -  rj_2  (i  7*  j  )  • 

3 . 3  Mean  Absolute  Error 

A  third  measure  of  forecast  quality  which  is  simple  to  compre¬ 
hend,  if  less  appropriate  to  use,  is  the  mean  absolute  error 
(MAE).  This  is  simply  the  probability-weighted  distance  in 
terms  of  number  of  states  by  which  the  forecast  is  in  error. 

For  example,  estimating  a  probability  of  one  on  state  3  when 
state  4  occurs  is  an  MAE  of  one;  a  probability  of  one  on  state  3 
when  state  5  occurs  is  an  MAE  of  two;  a  probability  of  0.5  on 
state  1  and  0.5  on  state  2  when  state  3  occurs  is  an  MAE  of  1.5, 
and  so  on. 

Mathematically,  when  state  j  occurs,  the  MAE  can  be  defined  as 

5 

MAE  j  (r )  =  X!  |i~j  I  fi  . 

i=l 

Note  that  MAE  ranges  from  a  best  score  of  zero  for  a  categorical 
forecast  of  the  correct  state  to  a  worst  score  of  four  for  a 
categorical  forecast  four  states  distant  from  the  correct  state. 
MAE,  like  the  RPS,  has  a  minimum  which  depends  upon  the  actual 
state.  Its  use  in  the  present  evaluation  is  to  assist  our 
intuition  as  to  the  magnitude  and  importance  of  any  differences 
in  aided  versus  unaided  forecasts. 
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4.0  RESULTS 


The  data  employed  in  this  evaluation  are  presented  in  raw  form 
in  the  Appendices.  In  this  section,  the  summary  data  are  pre¬ 
sented  for  RPS ,  PS,  and  MAE  as  obtained  for  each  of  three  time 
periods  and  six  models.  In  all  cases,  although  the  computer 
models  were  run  for  every  day  (by  supplying  the  observed  states 
when  analysts  became  free  to  provide  them) ,  absences  by  analysts 
and  the  necessity  to  compare  aided  and  unaided  forecasts  only 
when  both  were  available,  meant  that  gaps  were  generated  at  the 
same  points  for  both  aided  and  unaided  data. 

4.1  Appropriateness  of  the  Models  for  Evaluation 

In  order  to  achieve  a  good  opportunity  to  compare  aided  (PREDICT 
computer-generated)  and  unaided  forecasts  it  is  necessary  to 
have  (1)  sufficient  data  to  obtain  precise  estimates  of  the 
various  scores,  (2)  appropriate  model  difficulty,  and  (3)  a 
reasonable  number  of  state  changes.  Appropriate  model  diffi¬ 
culty  means  that  the  forecasting  task  was  neither  so  difficult, 
that  equiprobable  forecasts  result  (0.20  on  each  state  each  day) 
or  so  easy  that  near-perfect  forecasts  were  always  produced. 
State  changes  are  necessary  to  assure  that  the  forecasters  can 
adapt  to  such  changes  appropriately  (and  because  non-trivial 
models  should  undergo  state  changes  over  time).  For  the  six 
models  (named  Alpha  through  Lambda)  the  total  number  of  days  on 
which  forecasts  could  be  compared  and  the  distribution  of  those 
observations  across  the  five  model  states  is  shown  in  Table  1. 
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Table  1:  Distribution  of  Observations  Across  States 


State 


Model 

1 

2 

3 

4 

5 

Total 

t 

Alpha 

0 

3 

25 

49 

38 

115 

Beta 

0 

0 

83 

0 

0 

83 

Gamma 

0 

57 

77 

0 

0 

134 

Delta 

128 

0 

0 

0 

0 

128 

Epsilon 

0 

0 

0 

132 

0 

132 

Lambda 

0 

107 

6 

0 

0 

113 

Total 


From  the  table,  it  is  clear  that  only  one  model  (Alpha)  exhibits 
much  coverage  of  the  five  states.  Three  of  the  models  (Beta, 
Delta,  Epsilon)  remain  fixed  in  the  initial  state  throughout  the 
four  to  five  month  experiment.  For  those  models,  examination  of 
the  data  reveals  that  both  the  aided  and  unaided  forecasts 
placed  exactly  or  very  nearly  100%  probability  on  the  fixed 
state;  in  other  words,  these  problems  were  too  easy  to  generate 
useful  evaluation  data.  It  should  be  noted  that  these  models 
were  probably  selected  as  representing  likely  areas  of  change 
and  intelligence  interest  which  never  materialized. 


4,2  Comparison  of  Evaluation  Scores 


The  average  RPS,  PS,  and  MAE  scores  for  the  Alpha  model  are 
shown  in  Table  2.  Shown  also  are  the  standard  errors  of  the 
mean  RPS  and  PS,  and  the  number  of  observations  in  each  state. 
All  data  are  broken  down  by  time  period  of  forecast. 
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Table  2:  Summary  Data  for  Model  Alpha 


Forecast  Period 


5-day 

10-day 

30-day 

aided  unaided 

aided  unaided 

aided  unaided 

mean  RPS 

0.888  0.914 

0.894  0.919 

0.892  0.873 

std  error  of  mean 

0.008  0.009 

0.005  0.008 

0.010  0.009 

mean  PS 

0.739  0.822 

0.717  0.785 

0.822  0.696 

std  error  of  mean 

0.015  0.018 

0.019  0.021 

0.014  0.029 

mean  MAE 

0.825  0.748 

0.829  0.747 

0.878  0.950 

distrib  of  observations: 

state  1 

0 

0 

0 

state  2 

2 

0 

1 

state  3 

12 

9  ! 

6 

state  4 

35 

36 

21 

state  5 

35 

34 

31 

Total 

84  ! 

79 

59 

From  the  table,  two  conclusions  can  be  drawn  immediately. 

First,  both  the  unaided  and  aided  forecasts  do  quite  well  accord¬ 
ing  to  all  measures  computed.  At  the  most  intuitive  level,  the 
mean  absolute  error  is  in  all  cases  less  than  one  state:  fore¬ 
casts  are,  on  the  average,  always  less  than  one  state  distant 
from  the  true  state.  Second,  the  aided  forecast  is  not  as  good 
as  the  unaided  forecast  in  the  case  of  five-  and  ten-day  fore¬ 
casts,  but  is  superior  to  the  unaided  forecast  in  the  case  of 
30-day  forecasts.  The  differences  appear  to  be  large  relative 
to  the  precision  of  the  measures  as  indicated  by  the  standard 
errors  of  the  means . 

Further  statistical  analysis  of  the  data  is  made  extremely 
difficult  by  a  number  of  factors.  First,  it  is  not  possible  to 
compare  without  qualification  the  scores  obtained  for  the  differ¬ 
ent  forecast  periods.  This  is  because  forecasts  produced  on  the 
same  day  are  scored  against  the  observed  state  on  different 
days,  five,  ten,  and  thirty  days  later.  This  is  particularly 
serious  for  the  RPS  and  MAE  scores,  which  depend  directly  (inde¬ 
pendent  of  the  forecast)  on  the  observed  state.  Second,  while 
it  was  hoped  that  the  six  models  constructed  by  six  different 
analysts  would  provide  a  reasonable  sample  of  the  possible 
PREDICT  applications,  since  five  of  the  six  models  exhibited 
almost  no  state  changes  and  nearly  perfect  forecasts  by  both 
analyst  and  computer,  only  the  single  Alpha  model  remains  as  our 
sample.  Thus  generalizations  cannot  be  supported. 

To  assist  in  interpreting  the  magnitude  of  the  observed  differ¬ 
ences  in  RPS,  consider  as  a  baseline  forecasting  system  one 
which  assesses  each  day  a  uniform  forecast  across  the  five 
states.  Such  a  system  assumes  that  the  five  states  represent 
reasonable  possibilities  for  the  particular  problem  but  cannot 
say  which  are  more  likely  than  others.  The  formula  for  RPS  in 
such  a  case  reduces  to 
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RP  S-i(0. 2,0. 2, 0.2,0. 2,0-2)  =  1.2  0.05  ^  !  1 1 — 3  I  • 

i=l 

For  the  Alpha  model  the  resulting  RPS  scores  would  be  0.795, 
0.791,  and  0.776  for  the  5-  ,  10-  ,  and  30-day  periods,  respec¬ 
tively.  If  the  RPS  scores  for  the  Alpha  model  less  the  baseline 
scores  are  divided  by  unity  less  the  above  baseline  scores, 
scores  are  produced  which  represent  the  fraction  of  possible 
improvement  above  the  baseline  achieved  by  the  different  fore¬ 
casts.  Such  "normalized"  scores  are  shown  in  Table  3. 

Table  3:  Alpha  RPS  Scores  Renormalized 


Forecast  Period 


5-day 

10-day 

30 

-day 

aided 

unaided 

aided 

unaided 

aided 

unaided 

mean  RPS' 

0.454 

0.580 

0.492 

0.612 

0.518 

0.433 

std  error  of 

mean 

0.004 

0.006 

0.003 

0.005 

0.006 

0.004 

From  the  table,  for  the  five-day  forecasts,  the  PREDICT-gener- 
ated  forecast  achieves  45.4%  of  the  potential  improvement  from 
the  baseline  uniform  forecast,  while  the  unaided  analyst 
achieves  a  58%  improvement.  Thus  either  the  aided  or  unaided 
forecasts  are  worth  on  the  order  of  half  the  amount  one  should 
be  willing  to  pay  for  a  perfect  forecasting  system.  To  the 
extent  that  the  baseline  system  is  a  pessimistic  baseline,  these 
percentages  should,  of  course,  be  less. 

Summary  results  for  models  Beta  through  Gamma  are  shown  in 
Tables  4  through  8.  In  all  cases,  for  these  models,  forecasts 
are  so  nearly  perfect  that  these  data  are  not  considered  fur¬ 
ther.  For  the  models  with  little  or  no  state-changing  activity 
studied  here,  both  the  unaided  analysts'  direct  probability 
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Table  5:  Summary  Data  for  Model  Gamma 


Forecast  Period 


mean  RPS 

std  error  of  mean 

mean  PS 

std  error  of  mean 

mean  MAE 

distrib  of  observations: 
state  1 
state  2 
state  3 
state  4 
state  5 
Total 


5-day 

10-day 

aided  unaided  aided 

unaided 

.923  .955  .876 

.919 

.011  .005  .011 

.011 

.989  .999  .982 

.998 

.002  .000  .002 

.000 

.682  .471  1.024 

.682 

0 

0 

23 

22 

24 

20 

0 

0 

0 

0 

47 

42 

30-day 


.925 

.005 

.997 

.000 


Table  6:  Summary  Data  for  Model  Delta 


Forecast  Period 


5-day 

10-day 

30-day 

aided  unaided 

aided  unaided 

aided  unaided 

mean  RPS 

1.000  1.000 

1.000  .999 

1.000  .995 

std  error  of  mean 

.000  .000 

.000  .000 

.000  .000 

mean  PS 

1.000  1.000 

1.000  1.000 

1.000  1.000 

std  error  of  mean 

.000  .000 

.000  .000 

.000  .000 

mean  MAE 

.000  .051 

.000  .120 

.000  .209 

distrib  of  observations: 

state  1 

101 

101 

97 

state  2 

0 

0 

0 

state  3 

o 

0 

0 

state  4 

0 

0 

0 

state  5 

0 

0 

0 

Total 

101 

101 

97 

Ji 


Table  7:  Summary  Data  for  Model  Epsilon 


5-day 


Forecast  Period 

10-day  T 


30-day 


mean  RPS 

std  error  of  mean 

mean  PS 

std  error  of  mean 

mean  MAE 

distrib  of  observations; 


aided 

unaided  aided 

unaided  aided 

unaided 

1.000 

.967  1.000 

.963  1.000 

.947 

.000 

.002  .000 

.002  .000 

.004 

1.000 

.877  1.000 

.864  .998 

.798 

.000 

.009  .000 

.009  .000 

.018 

.000 

.408  .000 

.434  .040 

.498 

state  1 

0 

0 

0 

state  2 

0 

0 

0 

state  3 

0 

0 

0 

state  4 

126 

121 

102 

state  5 

0 

0 

0 

Total 

126 

121 

102 

Table  8:  Summary  Data  for  Model  Lambda 


mean  RPS 

std  error  of  mean 

mean  PS 

std  error  of  mean 

mean  MAE 

distrib  of  observations 
state  1 
stare  2 
state  3 
state  4 
state  5 
Total 


Forecast  Period 


5-day 

10-day 

aided  unaided 

aided  unaided 

aided 

.935  .959 

.903  .956 

.867 

.002  .002 

.001  .002 

,  .002 

1.000  1.000 

1.000  1.000 

1.000 

.000  .000 

.000  .000 

.000 

.525  .449 

.644  .468 

.830 

0 

0 

102 

98 

E 

6 

5 

0 

0 

0 

1 

0 

108  | 

103  | 

8 

30-day 


,000 


"WWW 


5.0  CONCLUSIONS 


Too  many  of  the  models  constructed  turned  out  to  be  too  easy  for 
analysts  (and  the  PREDICT  software)  to  forecast  accurately.  In 
these  instances,  the  PREDICT  system  succeeded  in  producing 
nearly  perfect  forecasts?  thus,  while  the  rather  complex  mathe¬ 
matical  model  was  not  afforded  an  opportunity  to  do  better  than 
the  analysts'  direct  assessments,  it  did  pass  a  less  stringent 
test  by  not  performing  markedly  worse  than  the  unaided  analysts. 

In  the  case  of  the  one  model  which  had  sufficient  difficulty  and 
exhibited  frequent  state  changes  (Alpha),  the  unaided  forecasts 
were  superior  to  the  PREDICT-generated  forecasts  for  the  5-day 
and  10-day  forecasts  and  the  PREDICT  forecast  was  superior  for 
the  30-day  forecasts.  The  sizes  of  observed  effects  are  such 
that  the  PREDICT  model  would  appear  to  be  able  to  substitute 
successfully  for  an  analyst  during  periods  of  absence  or  over¬ 
load,  in  cases  similar  to  those  modeled  here. 

The  data  provide  a  suggestion  that  for  longer-term  forecasts  in 
dynamic  situations  the  PREDICT  system  could  serve  to  improve  the 
quality  of  analysts’  forecasts.  For  the  one  case  observed,  the 
magnitude  of  the  improvement  from  computer-aiding  is  20%  of  the 
value  of  the  unaided  forecast  when  using  a  baseline  uniform 
forecast  for  comparison. 
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APPENDIX:  DATA  TABLES 
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A . 1  Format  of  Tables 

The  data  for  unaided  forecasts  are  presented  in  relatively 
unprocessed  form  in  Subsection  A. 2,  with  the  corresponding  data 
for  computer-generated  forecasts  in  Subsection  A. 3.  The  formats 
of  both  data  files  are  identical.  To  illustrate  concretely, 
consider  page  1  of  the  listing  of  file  "ALPHA. UFN"  (Alpha- 
unaided).  Data  are  presented  on  one  line  for  each  day.  The 
first  line  is  always  1  June  1981.  Each  successive  line  is  the 
following  day.  For  each  line,  17  fields  (items)  of  varying 
width  are  presented.  Field  one  is  a  three-digit  date  (month  and 
day);  it  is  shown  only  for  those  days  on  which  probability 
forecasts  were  collected  from  the  analyst.  The  first  date  on 
which  forecasts  were  generated  was  June  23.  The  second  field  is 
the  state  actually  observed  by  the  analyst  on  that  date.  Thus 
model  Alpha  was  reported  to  be  in  state  3  on  June  23. 

The  remaining  fifteen  3-digit  fields  show  the  probabilities  (on 
a  0  to  100  integer  scale)  assessed  for  that  date,  the  first  five 
fields  corresponding  to  the  probabilities  of  states  one  through 
five  forecast  5  days  earlier,  the  next  five  for  the  forecast 
produced  10  days  earlier,  and  the  last  five  for  the  forecast 
generated  30  days  earlier,  all  forecasts  pertaining  to  the  date 
corresponding  to  the  particular  line.  Five  zeroes  for  a  fore¬ 
cast  indicates  no  forecast  was  collected.  Thus  for  June  23  no 
forecasts  were  available  for  the  states  on  that  date.  For  June 
28,  a  forecast  produced  five  days  earlier  indicated  probabili¬ 
ties  of  0.01  each  for  states  1  and  2,  0.4(  for  state  3  (the 
state  actually  observed  on  that  date),  0.59  for  state  4,  and 
0.08  for  state  5?  no  forecasts  for  June  28  were  available  corre¬ 
sponding  to  10-day  and  30-day  intervals.  Dates  are  shown  as 
zero  for  days  on  which  forecasts  were  not  collected ;  thus  no 
forecasts  were  generated  on  June  28.  In  general,  even  when 
forecasts  were  not  generated,  the  observed  states  for  those  days 
were  estimated  at  the  next  forecasting  session. 
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A. 2  Data  for  Unaided  Forecasts 
A. 2.1  Alpha. 
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